CNN and Image Classification
When we do learn image, the larger image may lead the more parameters and more computation. But we don't actually need to looks at the entire image and we want the incoming weights to focus on local patterns of input image. We would like to use high-level operation convolution to extract features from the image.
Denote as convolution operation, we have:
- (commutative)
- (linearity)
The thing we would convolve by is call a kernel or filter. We can use kernel to do blurring, sharpening, edge detection, etc.
Convolution Layer
That is, we would like to add a convolution layer to our neural network.
Usually, when we have input image and kernel, we would have output image. If we have padding, we would have output image. If we have additional stride , we would have output image. (i.e. )
- padding: add pixels of zeros around the input image.
- stride: the number of pixels between adjacent receptive fields in the horizontal and vertical directions.
Ways to measure the size of a network:
- Number of units. (activations in memory during training)
- Number of weights. (weights in memory during training, overfitting might lead by parameters)
- Numer of connections. (approximately 3 add-multiply operations per connection, 1 for forward pass, 2 for backpropagation)
- a fully connected layer with inputs units and output units has weights and connections.
More comparison, given the same input and output size ():
- output maps
- input maps
- kernel size
- height of input
- etc...
Layer Type | Number of Weights | Number of Connections |
---|---|---|
Fully Connected | ||
Convolution |
The add item is the bias term.
Pooling Layer
These layers reduce the size of the representation and build in invariance to small transformations. Most commonly, we use max pooling which computes the maximum value of the units in a pooling group: . where is the pooling group.
For a input image with kernel and stride, we would have output image.
Convolutional Neural Network
We can combine convolution layer and pooling layer to form a convolutional neural network.
- The convolution layer with sets of kernels produce a set of feature maps (each obtained by convolving the input with a different kernel).
- Beyond the pooling layer, we would always like to have a linear rectification nonlinearity (ReLU) to introduce nonlinearity to the network.
- Then we apply on the pooling layer so that higher-layer filters can cover a larger region of inputs than equal-sized filters in lower layers.
Moveover:
- convolution layers are equivariant where the inputs and outputs are translated by the same amount.
- we would like to make nn predictions to be invariant, that is, translate inputs should not change the prediction.
- pooling layers provide invariant to small translation.
Object recognition
Object recognition is the task of identifying objects in images. It's closely related to object detection, which is the task of locating objects in images.
Some useful datasets sources MNIST, CIFAR-10, ImageNet, COCO, etc.